A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

نویسندگان

  • Olfa Nasraoui
  • Carlos Rojas
  • Cesar Cardona
چکیده

The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the ‘‘you only get to see it once’’ constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios. 2005 Elsevier B.V. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collaborative Filtering in Dynamic Streaming Environments

The increasing expansion of websites and their web usage necessitates increasingly scalable techniques for Web usage mining that can be better cast within the framework of mining evolving data streams [1, 5]. Despite recent developments in mining evolving Web clickstreams [3, 6], there has not been any investigation of the performance of collaborative filtering [2] in the demanding environment ...

متن کامل

Mining Evolving Web Clickstreams with Explicit Retrieval Similarity Measures

Data on the Web is noisy, huge, and dynamic. This poses enormous challenges to most data mining techniques that try to extract patterns from this data. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynam...

متن کامل

From Tweets to Stories: Using Stream-Dashboard to weave the twitter data stream into dynamic cluster models

Social media has recently emerged as an invaluable source of information for decision making. Social media information reflects the interests of virtual communities in a spontaneous and timely manner. The need to understand the massive streams of data generated by social media platforms, such as Twitter and Facebook, has motivated researchers to use machine learning techniques to try to discove...

متن کامل

TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model

Artificial Immune System (AIS) models hold many promises in the field of unsupervised learning. However, existing models are not scalable, which makes them of limited use in data mining. We propose a new AIS based clustering approach (TECNO-STREAMS) that addresses the weaknesses of current AIS models. Compared to existing AIS based techniques, our approach exhibits superior learning abilities, ...

متن کامل

Mining and tracking evolving web user trends from large web server logs

Recently, online organizations became interested in tracking users’ behavior on their websites to better understand and satisfy their needs. In response to this need, web usage mining tools were developed to help them use web logs to discover usage patterns or profiles. However, since website usage logs are being continuously generated, in some cases, amounting to a dynamic data stream, most ex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computer Networks

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2006